Plasma-Derived HIV-1 Virions Contain Considerable Levels of Defective Genomes

ABSTRACT Genetically-characterizing full-length HIV-1 RNA is critical for identifying genetically-intact genomes and for comparing these RNA genomes to proviral DNA. We have developed a method for sequencing plasma-derived RNA using long-range sequencing (PRLS assay; ∼8.3 kb from gag to the 3′ end or ∼5 kb from integrase to the 3′ end). We employed the gag-3′ PRLS assay to sequence HIV-1 RNA genomes from ART-naive participants during acute/early infection (n = 6) or chronic infection (n = 2). On average, only 65% of plasma-derived genomes were genetically-intact. Defects were found in all genomic regions but were concentrated in env and pol. We compared these genomes to near-full-length proviral sequences from paired peripheral blood mononuclear cell (PBMC) samples for the acute/early group and found that near-identical (>99.98% identical) sequences were identified only during acute infection. For three participants who initiated therapy during acute infection, we used the int-3′ PRLS assay to sequence plasma-derived genomes from an analytical treatment interruption and identified 100% identical genomes between pretherapy and rebound time points. The PRLS assay provides a new level of sensitivity for understanding the genetic composition of plasma-derived HIV-1 RNA from viremic individuals either pretherapy or after treatment interruption, which will be invaluable in assessing possible HIV-1 curative strategies. IMPORTANCE We developed novel plasma-derived RNA using long-range sequencing assays (PRLS assay; 8.3 kb, gag-3′, and 5.0 kb, int-3′). Employing the gag-3′ PRLS assay, we found that 26% to 51% of plasma-derived genomes are genetically-defective, largely as a result of frameshift mutations and deletions. These genetic defects were concentrated in the env region compared to gag and pol, likely a reflection of viral immune escape in env during untreated HIV-1 infection. Employing the int-3′ PRLS assay, we found that analytical treatment interruption (ATI) plasma-derived sequences were identical and genetically-intact. Several sequences from the ATI plasma samples were identical to viral sequences from pretherapy plasma and PBMC samples, indicating that HIV-1 reservoirs established prior to therapy contribute to viral rebound during an ATI. Therefore, near-full-length sequencing of HIV-1 particles is required to gain an accurate picture of the genetic landscape of plasma HIV-1 virions in studies of HIV-1 replication and persistence.

distance (APD). The percentages of genetic diversity of the sequences from each dilution for each participant were similar, further indicating that there was no sequencing-related bias within the dilutions (Table 1).
Defective HIV-1 genomes are present in plasma-derived virions. It has generally been assumed that the majority of virions within the plasma contain genetically-intact HIV-1 RNA genomes. To investigate the number of genetically-defective and intact HIV-1 RNA genomes within the plasma of untreated HIV-1-infected participants, we used the gag-39 PRLS assay to sequence HIV-1 RNA from the plasma of eight ART-naive HIV-1-infected participants. The plasma samples were collected during acute/early infection and chronic infection for participants P1 to P6 and participants P7 and P8, respectively. Participant characteristics and numbers of sequences amplified are shown in Table 2. We found an overall median of 65% (interquartile range [IQR], 63 to 69%) of plasma RNA genomes were genetically-intact in the gag-39 region. When the proportion of genetically-intact RNA sequences from the plasma of the acute/early participants was compared with that of the chronically-infected participants, we found that there was no significant difference between these participant groups. The acute/early group had a median of 66% (IQR, 63 to 70%) of genetically-intact genomes, compared to a median of 58% (IQR, 53 to 62%) for the chronic group ( Overall, we found that frameshift mutations were the most common type of genetic defect found, making up an overall median of 54% (IQR, 42 to 59%) of the total number of defective plasma-derived genomes amplified, followed by deletions of .100 bp (median, 37%; IQR, 29 to 51%) (Fig. 3B). Other defects such as single nucleotide polymorphisms (SNPs) causing premature stop codons and hypermutations had a lower frequency (median, ,5.5%; range, 0 to 11%). When comparing the data for the acute/ early versus chronic participants, we found weak evidence that the proportion of sequences containing a premature stop codon was higher in the chronically-infected participants than in the acute/early participants (Fig. 3C) (P = 0.04). We found no statistically significant difference in the proportion of sequences with any other type of defect between these two participant groups (Fig. 3C) (P . 0.32).
We also investigated whether these genetic defects are concentrated in any HIV-1 open reading frame (ORF). When pooling only the full-length gag-39 sequences (length, .8,250 bp) across all participants, we found that the total number of defects (frameshift and stop codons) was highly concentrated in env (49.5% of defects), followed by pol (29.2% of defects) ( Table 3). When the number of defects in each gene was taken as a proportion of the gene length to account for the differing lengths of the HIV-1 ORFs, the most defects were still found in env, followed by pol (Table 4).
To further investigate whether gag, pol, or env contained more defects in full-length (.8,250 bp) sequences, we adjusted the number of frameshifts or stop codons to the number of genomes analyzed in that gene region per participant ( Fig. 3D and E). We   found that gag had a lower mean value for defects/genome than env (Fig. 3D) (P = 0.032). We also observed that the mean defects/genome were lower in gag than in pol (0.045 and 0.104 defects/genome, respectively), though this did not reach statistical significance (Fig. 3D) (P = 0.328). When comparing acute/early infection and chronic infection, we observed a trend toward chronically-infected participants having a higher mean number of defects/genome in the env region than acute/early participants FIG 3 Types of defects present in the plasma of ART-naive HIV-1-infected individuals. gag-39 RNA genomes were amplified using PRLS and sequenced from the plasma of 8 ART-naive HIV-1-infected participants. The acute/early participants (n = 6) are depicted by circles, and the chronic participants are depicted by squares (n = 2). (A) Proportion of total plasma-derived gag-39 sequences that were genetically-intact in the acute/early group versus the chronic group. (B and C) Proportions of total genetically-defective plasma-derived gag-39 genomes with different types of genetic defects: acute/early and chronic groups combined (B) or separated (C). Data are represented as the median 6 IQR. (D and E) For full-length plasma-derived gag-39 genomes that lack deletions .100 bp in size, the number of defects (premature stop codons and frameshift mutations combined) found in gag, pol, or env was divided by the number of full-length genomes amplified per participant. This was compared between these three genomic regions for the acute/early and chronic participant groups combined (D) and separately (E). Data are represented as the mean 6 SD. *, P , 0.05; **, P , 0.01; ***, P , 0.001; n.s., P . 0.05.
( Fig. 3E) (P = 0.147). However, there was no difference in the number of defects/genome in gag and pol between the two participant groups (P . 0.9 for both). We note that the gag-39 PRLS assay does not amplify the first nine codons of gag, meaning that we could miss changes in the gag start codon that render a genome defective. Interestingly, out of 82 total sequences with deletions (.100 bp) identified in the plasma from the eight ART-naive participants (Table 2), we found that seven (8.5%), sequenced from five different participants, had deletion breakpoints that corresponded to the splice donor site D4 (HXB2 positions 6043 to 6053) and acceptor site A7 (HXB2 positions 8359 to 8381) (42). Splicing at these sites joins tat and rev exons 1 and 2. In addition, five sequences were identified with 39 deletion breakpoints corresponding to known splice acceptor sites (42), even though the 59 breakpoints of some did not correspond to a known splice donor site.
Genetic diversity of plasma genomes during untreated HIV-1 infection. Studies sequencing shorter HIV-1 RNA fragments have revealed that during untreated acute/ early infection, HIV-1 RNA is genetically homogenous (8,22,(43)(44)(45)(46). To determine if the genetic diversity of near-full-length genetically-intact sequences is different during acute/early and chronic infection, we calculated the genetic diversity of gag-39 plasma RNA sequences for each participant using average pairwise distance (APD). Similar to these earlier findings (8,22,(43)(44)(45)(46), the genetically-intact HIV-1 RNA genomes within the majority of the acute/early participants were less genetically diverse than those of the chronic participants (acute/early group = 0.05 to 1.72%, chronic group = 1.11 to 2.25%) (Table 5). However, the APD of the genetically-intact HIV-1 RNA genomes derived from participant P2, whose plasma sample was collected during early infection, was 1.72% (Table 5). This high genetic diversity and the topology of the phylogenetic tree observed for this participant (Fig. 4B) are consistent with previous reports of early infection with multiple variants (7,8,22). Participant P7, whose sample was collected during chronic infection, also had an extremely diverse population of genetically-intact genomes isolated from their plasma (APD = 2.25%) ( Table 5). Although genetic diversity is expected to increase during untreated chronic infection (8,22,45,46), the phylogenetic tree of this participant indicates infection with multiple variants, as shown by the distance between branches of the viral sequences on the tree (Fig. 2C). The presence of identical, or monotypic, plasma-derived genomes suggests that these genomes are part of a clone and were released as part of the same replication event, perhaps from the same infected cell or multiple infected cells with identical proviruses (46). We therefore investigated the proportion of genomes that are part of a clone within the plasma during untreated HIV-1 infection by using the gag-39 assay. For this analysis, identical sequences that are part of a clone are .99.98% identical, which allows a difference of 1 nucleotide (nt) between full-length gag-39 genomes. This was to allow for differences  introduced during a single round of viral replication (47) or assay-related error. We identified plasma clones in the gag-39 region in 5 of 8 participants (Table 5; Fig. 2C and Fig. 4 to 6). The proportion of plasma sequences that were part of a clone was low, ranging from 9.8% to 32%. As expected, the participants in earlier stages of untreated infection (Table 2), or those with lower genetic diversity, were more likely to have plasma clones that were part of only 1 or 2 clonal clusters. However, for P8, whose sample was collected during    chronic infection, a high proportion of clonal plasma sequences within the gag-39 region was identified (Table 5). These clonal sequences were part of 4 separate clonal clusters, 2 of which were found in regions of the phylogenetic tree with low genetic diversities of 0.03% (Table 5; Fig. 6). These results indicate that at least 4 different cells or groups of expanded cells contributed to viremia in this participant, similar to what has been identified previously during ART or ART interruption (48,49). Importantly, across all participants, only 2 of 250 (0.8%) defective gag-39 plasma genomes, both derived from participant P8, were 99.98% identical to a genetically-intact plasma genome, indicating that most errors are unique. Similarly, no sequences with deletions, including those with breakpoints coinciding with the D4 splice donor and A7 splice acceptor, were 99.98% identical to another short sequence. This indicates that these partially-spliced genomes could result from different replication events.
To investigate whether short HIV-1 RNA regions are predictive of the clonality of the full-length HIV-1 genome, we trimmed sequences amplified using the gag-39 PRLS assay to the primer binding sites for env (V1-V3; ;800 bp) and p6-RT (;1.3 kb), which have been commonly used to sequence plasma HIV-1 RNA genomes (8,17,50,51). We identified the proportion of clonal genomes within these trimmed alignments and compared that to the proportion of clonal genomes in the full gag-39 alignments. For simplicity, we identified clonal genomes as those that were 100% identical to another genome for this analysis. The proportion of clonal sequences was much higher for the env and p6-RT primer sets than for gag-39 for the two participants analyzed (Table 6). For participant P6, the proportion of clonal genomes increased from 13.6% to 82.4% when the gag-39 sequences were trimmed to the env primers, and to 65.7% when the sequences were trimmed to the p6-RT primers. For participant P7, there were no clonal gag-39 sequences, but when these sequences were trimmed to the env primer binding sites, 21% of sequences were found to be clonal, and 15% of sequences were clonal when trimmed to the p6-RT primer binding sites. This indicates that the gag-39 primer set has a better capacity to correctly identify clonal genomes.

FIG 6
Phylogenetic trees of pre-ART plasma-derived sequences for participant P8. All pre-ART plasma-derived gag-39 sequences were aligned, and phylogenetic trees were prepared for participant P8. Inner rings show whether a sequence is genetically-intact or defective for each type of defect shown. Outer rings show whether a sequence is .99.98% identical to another sequence on the tree. All sequences on the tree are plasma-derived, as indicated by the triangles. The genetic diversity of genetically-intact sequences within the most homogenous parts of the tree are shown between the dark gray lines. Branches with bootstrap values of $75 are indicated by an asterisk beside the node.
Longitudinal sample. To investigate if the genetic landscape of HIV-1 RNA virions changes over time during untreated chronic infection, we used the gag-39 PRLS assay to sequence HIV-1 RNA genomes from two different plasma samples for P7, collected 27 days apart (time point 1 [TP1] and TP2). The percentages of sequences identified as genetically-intact were similar between the two time points (49% versus 54%), and the percentages of the different defects also remained similar between the two time points (Fig. 7A). The genetic diversity of intact sequences within the gag-39 region did not change (2.3% for TP1, 2.2% for TP2). Similarly, the genetic diversity of intact sequences did not change substantially within specific gene regions (gag, TP1 = 1.73% and TP2 = 1.42%; pol, TP1 = 1.75% and TP2 = 1.59%; env, TP1 = 3.46% and TP2 = 3.4%). The sequences from the two time points were similar to one another and intermingled within the phylogenetic tree (Fig. 7B). This was confirmed using a test for panmixia, which indicates the probability that two groups of sequences come from two genetically distinct populations (52,53). We used a threshold of P , 10 23 to indicate a significant genetic difference between the two groups (52,54). We found no evidence of compartmentalization between the sequences from these two time points (P . 0.2). However, no identical gag-39 sequences were identified across TP1 and TP2 (Fig. 7B, inner ring), indicating that despite the genetic similarity of the plasma virions, the continued replication during untreated infection has changed the genomic landscape between these two time points. Interestingly, P7 appeared to be infected with multiple variants at TP1 (Table 5; Fig. 2C), and these variants were also represented in the sequences from TP2 (Fig. 7B).
Plasma sequences are similar to proviral PBMC sequences during untreated HIV-1 infection. Comparing plasma-derived HIV-1 RNA genomes of viremic participants to proviral sequences obtained from cellular compartments may indicate the cellular source of this viremia. For participants P1 to P6, whose samples were collected during untreated acute/early infection, PBMC samples were available from the same blood draw as the plasma samples. We therefore used FLIPS (38) to genetically-characterize the proviruses present in the peripheral blood of these participants. We then compared these proviruses to the plasma-derived genomes from the same time point. For simplicity, we compared only full-length sequences for this analysis (sequences lacking deletions of .100 bp). The number of sequences amplified and the number of PBMCs analyzed per participant are shown in Table 2.
We next investigated whether identical sequences could be found between the plasma and PBMC compartments, setting the threshold for genetic identity at 99.98%, as described above. Genetically-identical sequences between the plasma and PBMCs in the gag-39 region were identified in the four participants whose blood draw was taken during untreated acute infection (P3 to P6), with no identical sequences identified for the early infection participants (P1 and P2) ( Table 7). For the four acute infection participants, between 1 and 42 plasma-derived RNA sequences were .99.98% identical to at least one proviral DNA sequence and represented 2 to 34% of the plasma-derived RNA sequences. For all participants, the identical sequences between the plasma and PBMC compartments were found in a single cluster of .99.98% sequences. Interestingly, 3 out of 4 of these near-identical sequence clusters contained multiple plasma and proviral sequences that were 100% identical in the gag-39 region (Table 7), with most other sequences in the cluster differing from these by 1 nucleotide. Importantly, we found that none of the defective plasma-derived genomes were .99.98% identical to a proviral sequence. For P4, FLIPS analysis of two proviral genomes revealed that these sequences were genetically-intact and identical in the gag-39 region but had changes outside of this region (Fig. 5B). One of these proviral sequences had a SNP outside of the binding site for the inner gag-39 reverse primer but was still genetically-intact. The other proviral sequence had a deletion in stem-loop (SL) 2-3 of the cis-acting region, rendering the genome defective in the FLIPS 9.2-kb region but genetically-intact in the gag-39 region used to sequence plasma-derived HIV-1 RNA (8.3 kb). Additionally, one proviral sequence from P6 with a premature stop codon was 99.98% identical to several genetically-intact proviral and plasma sequences in the gag-39 region (Fig. 5D).
Although the number of identical sequences found within the paired plasma and PBMC samples was low, there did not appear to be any compartmentalization of the sequences derived from the PBMCs and plasma during untreated HIV-1 infection. For   all participants, we observed that sequences from the PBMCs and plasma appeared to be interspersed on their phylogenetic trees ( Fig. 4 and 5). This result was also supported by panmixia analysis showing a lack of genetic compartmentalization between the genetically-intact plasma-derived RNA and proviral DNA sequences within all participants (P . 10 23 ) (52,53). Genetically mapping persistent HIV-1 within pre-ART and ATI samples. Previous studies sequencing short regions of the HIV-1 RNA genome have indicated that identical proviral sequences contribute to viral rebound during ART interruption (15,17). To genetically-characterize a longer region of the rebound viral genomes, we used the int-39 PRLS assay to amplify and sequence plasma-derived RNA genomes in ATI samples from three of the participants who initiated ART during acute infection (P3 to P5) ( Table 2). These participants continued ART for 1 year before interrupting ART three times over the course of the following year (55). The collection time, viral loads, and number of amplified int-39 sequences for the ATI plasma samples are shown in Table 2 and Fig. 8. We used the int-39 PRLS assay to sequence plasma-derived genomes from these ATI plasma samples with lower viral loads to maximize the number of amplicons available for comparison to pre-ART sequences. To compare the int-39 sequences amplified from the ATI plasma to the gag-39 sequences amplified from pre-ART plasma and FLIPS sequences amplified from pre-ART PBMCs, we trimmed all sequences to the ;4.7-kb region overlapping the gag-39 and int-39 amplicons (referred to as trimmed int-39 region).
For all three participants, we observed that the proportion of 100% identical sequences within the trimmed int-39 region was higher in the ATI plasma sequences than in the pre-ART plasma sequences, increasing from a median of 13% (IQR, 9.2 to 17.3%) in the pre-ART plasma to 30.8% (IQR, 22.3 to 42.2%) in the ATI plasma, though this did not reach statistical significance (P = 0.25) ( Fig. 9A; Table 8). We also observed a trend that the proportion of trimmed int-39 sequences that were .99.97% identical (allows 1 nt change between sequences) was higher in the ATI plasma sequences than in the pre-ART plasma sequences (pre-ART mean = 29.5%, ATI mean = 63.3%, P = 0.09) ( Fig. 9B; Table 8). This shows that a large proportion of rebounding virions present during ART interruption are genetically-identical within the sequenced region, indicating that proliferating cells are contributing to the rebounding virus during an ATI.
Considering all three participants together, within the trimmed int-39 region, the proportion of plasma-derived sequences that were genetically-intact did not differ significantly between the ATI plasma-derived sequences and the pre-ART plasma-derived sequences (P = 0.26) ( Fig. 9C; Table 8). The mean proportion of genetically-intact plasma sequences in the pre-ART plasma for the three participants was 77% (95% CI, 70.1%, 83.9%), and the mean for the ATI plasma-derived sequences was 87% (95% CI, 79.1%, 94.9%). For all participants, the most common type of genetic defect was a frameshift mutation (Table 2). However, participants P3 and P5 did have a considerable increase in the proportion of genetically-intact plasma-derived sequences in the trimmed int-39 region between the pre-ART plasma sequences and ATI plasma sequences, increasing from 70% to 90% for participant P3 and from 80% to 92% for participant P5 (Fig. 9C). The genetic diversity of genetically-intact sequences within the trimmed int-39 region also did not change substantially between pre-ART and ATI plasma genomes (Table 8).
When we compared the pre-ART plasma and PBMC sequences to ATI plasma sequences for each participant, we found ATI plasma-derived sequences that were 100% identical to a pre-ART plasma sequence within the trimmed int-39 region for all three participants (Fig. 10A to C). All sequences we identified as 100% identical to another sequence in the trimmed int-39 region were genetically-intact in this region, though some were found to have genetic defects outside of this region ( Fig. 10A to C). For participants P3 and P5, we found examples of clusters of multiple ATI plasma sequences that were 100% identical to a pre-ART plasma sequence ( Fig. 10A and C). These pre-ART plasma sequences were genetically-intact in the full ;8.3-kb gag-39 sequenced region. Participants P3 and P4 also had an ATI plasma sequence that was After the indicated number of weeks, the participants reinitiated ART and then interrupted ART for a third time (phase B3). Participants P3 and P5 reinitiated ART after the indicated number of weeks. The ATI plasma sample used for the PRLS assay int-39 amplification is indicated by a red dot and text, with the viral load at this time noted. The limit of detection (LOD) is ,50 copies/mL. 100% identical within the trimmed int-39 region to a cluster containing both pre-ART plasma and PBMC sequences ( Fig. 10A and B). For participant P3 (Fig. 10A), two of the pre-ART plasma sequences and two of the pre-ART PBMC sequences that contributed to this cluster were also found to be genetically-intact and 100% identical in their respective near-full-length sequences (i.e., 8.3-kb gag-39 for plasma-derived RNA sequences and 9.2 kb for FLIPS PBMC proviral sequences). For participant P4 (Fig. 10B), two of the pre-ART plasma sequences were 100% identical to six of the pre-ART PBMC sequences in the gag-39 region, four of which were also 100% identical and genetically-intact in the 9.2-kb FLIPS region. For this participant, two of the proviral sequences had defects outside the trimmed int-39 region (Fig. 10B). Together, these findings indicate that many virions sequenced during an ATI are genetically-intact within the trimmed int-39 region. Several ATI viral sequences were genetically-identical to viral sequences from pre-ART plasma and PBMC samples, indicating that HIV-1 reservoirs established prior to therapy can contribute to rebound virus.

DISCUSSION
We have developed a novel method for near-full-length sequencing of plasma HIV-1 RNA by employing an SGS approach to amplify multiple individual ;8.3 kb RNA genomes. We used the gag-39 plasma-derived HIV-1 RNA using long-range sequencing (PRLS) assay to investigate the genetic composition of HIV-1 RNA within the plasma of eight ART-naive participants. In conducting this study, we found that 26 to 51% of viral particles present in the plasma of untreated HIV-1-infected individuals are geneticallydefective.
Our studies of the HIV-1 RNA sequences from the culture supernatant of reactivated J-Lat cells revealed that the PRLS assay has a very low error rate of 0.0076%, or ;0.63 nucleotides/gag-39 genome. We found a higher rate of frameshift mutations when sequencing HIV-1 gag-39 RNA genomes from the plasma of untreated participants compared to the gag-39 sequences from the supernatant of J-Lat cells. In both the plasma of untreated HIV-1-infected participants and the J-Lat culture supernatant, the majority of these frameshift mutations were the result of 1-or 2-bp insertions or deletions (indels) in a homopolymer region of the HIV-1 genome. Indels within homopolymer regions can occur as a result of reverse transcriptase (RT)-PCR or sequencing error (56-59), though these indels can also be introduced during the HIV-1 replication cycle FIG 9 Properties of plasma-derived sequences from pre-ART and ATI plasma samples are similar. Plasma-derived sequences from pre-ART and ATI plasma samples for participants P3, P4, and P5 were trimmed to the ;4.7-kb region spanning int-39. (A and B) The proportions of sequences that were 100% (A) or .99.97% (B) identical to another sequence in the same sample were compared between the pre-ART and ATI plasma samples. A statistical comparison was performed using a Wilcoxon test and a paired t test, respectively. (C) The proportions of trimmed int-39 plasma-derived sequences that are genetically-intact within the trimmed int-39 region were compared between the pre-ART and ATI plasma samples. A statistical comparison was performed using a paired t test. ns, P . 0.05.   (47,(60)(61)(62)(63). This indicates that the increased rate of frameshifts observed in the HIV-1 RNA from these participants is most likely the result of a combination of assay-related and viral replication errors. Since these mutations can be introduced during the HIV-1 replication cycle, we consider them genetic defects within the plasma-derived HIV-1 RNA. This is the first study to quantify the proportion of genetically-intact and defective plasma-derived HIV-1 RNA genomes at a single-molecule level in untreated HIV-1 infection. We found that a median of 65% of plasma-derived HIV-1 genomes are geneticallyintact during untreated infection, leaving 26 to 51% of plasma-derived genomes genetically-defective and therefore likely not infectious. The presence of defective proviral genomes in HIV-1-infected cells has been well-documented, both in untreated HIV-1 FIG 10 Phylogenetic trees of genomes from pre-ART plasma and PBMC samples and ATI plasma samples. Plasma-derived sequences from pre-ART and ATI plasma samples for participants P3, P4, and P5 were trimmed to the ;4.7-kb region spanning int-39 and aligned with trimmed proviral sequences from pre-ART PBMC samples, with sequences hypermutated or deleted in the trimmed int-39 region excluded. Phylogenetic trees were then prepared for participants P3 (A), P4 (B), and P5 (C). PBMCderived sequences are represented by circles, while plasma-derived sequences are represented by triangles. Inner rings show whether the sequence is from a pre-ART sample or an ATI sample. The second rings show whether a sequence is genetically-intact, or defective for each type of defect shown, within the full region sequenced for that sample (i.e., 8.3 kb gag-39 for pre-ART plasma and 9.2 kb FLIPS for pre-ART PBMC). The third rings show whether a sequence is genetically-intact or defective within the trimmed int-39 region. The outer rings show whether a sequence is 100% identical to another sequence in the tree in the trimmed int-39 region. Branches with bootstrap values of $75 are indicated by an asterisk beside the node.
infection (64)(65)(66)(67) and during effective ART (38,64,68), but it has generally been assumed that nearly all HIV-1 RNA genomes within plasma virions are genetically-intact. This is despite the fact that defective genomes have been observed when sequencing subgenomic regions of HIV-1 RNA from the plasma (19,(26)(27)(28)(29)69) and that ex vivo measurements of infectivity have revealed that most virions present in vivo are not infectious (70)(71)(72). The high proportion of defective genomes identified in our study using gag-39 PRLS is consistent with the low level of infectious viral particles identified in these ex vivo measurements (1 tissue culture infective dose [TCID 50 ] per 11,550 viral RNA copies) (71). Importantly, the defective genomes we have identified may contribute to the production of decoy HIV-1 proteins, which are known to be expressed from defective proviruses (73,74). The host CD8 1 T cell response to these decoy proteins contributes to HIV-1 immune evasion, chronic immune activation, and T cell exhaustion during ART (74,75). These defective genomes may therefore have a pathogenic effect despite their presumed lack of infectivity. Recent studies sequencing subgenomic regions of plasma HIV-1 RNA have revealed that these sequences can match both genetically-intact and defective proviral sequences from HIV-1-infected cells (25,64). This, in conjunction with our finding that defective genomes are present in the plasma of untreated HIV-1-infected individuals, indicates the importance of sequencing near-full-length HIV-1 RNA.
To ensure that our PRLS assay produced consistent results, we analyzed the plasma-derived gag-39 HIV-1 RNA sequences in samples from two time points from the same participant during untreated chronic infection (P7), taken 27 days apart. The proportion of genetically-intact sequences and the proportion of each type of genetic defect were very similar between the two time points, and there was no evidence of sequence compartmentalization between the two time points. The viral genetic diversity has been shown to accumulate slowly during untreated chronic HIV-1 infection in more conserved genomic regions such as pol (43), while it may accumulate more quickly in more diverse regions such as env due to differing selection pressures (45). While we did observe a large difference in the genetic diversity of genetically-intact sequences within gag and pol compared to env for both time points, we did not observe a large change in the genetic diversity between TP1 and TP2 for any region. Overall, these results indicate the reproducibility of the PRLS assay and the slow accumulation of genetic diversity over 27 days of untreated chronic HIV-1 infection.
Genetic defects were found in all ORFs analyzed in this study, indicating the importance of considering the entire HIV-1 genome where possible when characterizing viral genomes in the plasma of HIV-1-infected individuals. We did, however, find an uneven distribution of defects across the genome, with a higher number of defects/genome found in pol and env than in gag. The 59 half of the gag RNA genome has been shown to be important during packaging, allowing the selective incorporation of full-length RNA genomes into new virions during productive infection (76,77). This may lead to a lower accumulation of mutations in gag during viral replication. We do acknowledge, however, that the gag-39 PRLS assay does not sequence the first nine codons of gag, and mutations in this region are not identified. Nikolaitchik et al. demonstrated the importance of the first six nucleotides of gag in efficient packaging of RNA genomes by the HIV-1 Gag protein (78). Therefore, even though we did not sequence this region, it is reasonable to assume that this early portion of gag is intact, as the genomes were packaged into virions and released into the plasma. Alternatively, new virions require a functional Gag protein, and it has been demonstrated that Gag alone is able to assemble into a virus-like particle in culture (79). Therefore, the low proportion of plasmaderived genomes with defects in gag may reflect the fact that new virions are not able to be formed without a functional Gag protein. The increased proportion of defects found in env may reflect an increased accumulation of mutations in env in response to immune pressure, such as neutralizing antibodies (80)(81)(82) or CTL escape (83). This increased accumulation of mutations likely would also lead to an increased introduction of defects during viral replication. Further demonstrating this accumulation of mutations in env, we found a significantly higher genetic diversity within the env gene than in gag and pol within the genetically-intact plasma-derived gag-39 sequences in our study of untreated HIV-1 infection. Furthermore, this immune pressure has also been previously shown to lead to mutations in env that can cause the virus to be less replication-competent, indicating that the presence of immune pressure can lead to genetic defects in HIV-1 env (84, 85). These findings highlight how sequencing subgenomic regions can greatly affect the observed genetic diversity and the presence of mutations within plasma-derived virions but not represent the true distribution of mutations across the genome.
We sequenced gag-39 plasma-derived genomes with deletions of .100 bp, and 8% of these had break sites corresponding to the splice donor site D4 and splice acceptor site A7, which join tat and rev exons 1 and 2 in multispliced RNA transcripts (42). This suggests the presence of spliced RNA in plasma-derived virions, which has been demonstrated by in vitro transfection studies (86)(87)(88). However, this splicing of D4 to A7 does not occur without prior splicing of splice donor D1 to another downstream acceptor site (42,89,90). The deleted sequences we observed from the plasma reflect splicing of D4 to A7 alone, which could indicate that the presence of these genomes in the plasma are a result of another mechanism, such as extensive RNA secondary structure around the splice sites (91)(92)(93)(94).
By employing the gag-39 PRLS assay, we observed a relatively low proportion of identical plasma-derived RNA genomes during untreated HIV-1 infection compared to that of studies which sequenced subgenomic regions of the HIV-1 genome (8,22,46). However, we did observe identical plasma-derived RNA genomes for those participants who were at an earlier stage of infection or whose genetically-intact plasma-derived sequences had an overall lower genetic diversity. This indicates that a lower genetic diversity increases the likelihood of identifying identical sequences. We also found that identical genomes between the plasma and PBMC compartments during untreated infection were rare, though they were more common in participants that were in acute infection than in early infection. Despite the low number of identical sequences identified between the plasma and PBMCs, we found no evidence of compartmentalization between the two sites. The lack of identical sequences between these two sites could be attributed to the rapid turnover of HIV-1-infected cells during productive infection (95)(96)(97), an alternative anatomic source of some plasma virions (53,96,98), the depth of our sequencing limiting our ability to identify identical genomes, the sampled PBMCs representing only a low proportion of the circulating lymphocytes present at the time of blood draw (99), or the rapid introduction of mutations during active replication leading to multiple nucleotide differences between the RNA and DNA HIV-1 sequences.
For three participants whose peripheral blood samples were collected during untreated acute infection (P3, P4, and P6), we identified full-length (.8.25-kb) monotypic plasma-derived sequences that were 100% identical to at least two proviral sequences from their cells, and these identical sequences were genetically-intact within the sequenced gag-39 region. For one participant (P6), we observed a large monophyletic group of 24 sequences which contained 12 monotypic RNA sequences and 12 identical proviral/DNA sequences. As the majority of cells in the peripheral blood contain one integrated HIV-1 DNA molecule, these findings suggest that multiple cells harboring genetically-intact and genetically-identical proviruses are contributing to plasma viremia (46). We cannot, however, exclude the possibility that these findings are the result of HIV-1-specific cells proliferating during ongoing viral replication (100). When we set the threshold for genetic identity at 99.98%, which allows a difference of 1 nt between full-length genomes, genetically-identical sequences between the plasma and PBMCs in the gag-39 region were identified in all four participants whose peripheral blood samples were collected during acute infection. In identifying sequences that were .99.98% identical between the plasma and PBMCs in the gag-39 region, we found only one case where a genetically-intact plasma sequence was 99.98% identical to a defective PBMC sequence (containing a premature stop codon in pol). No examples of defective plasma sequences being .99.98% identical to a fulllength PBMC sequence in the gag-39 region were identified. This indicates the importance of identifying genetically-intact HIV-1 RNA genomes for comparison with proviral genomes when investigating which HIV-1-infected cells contribute to viremia. To our knowledge, this is the first study to use near-full-length single HIV-1 plasma-derived RNA sequences to identify 100% or near-identical sequences between the HIV-1 plasma and proviral compartments.
For three participants who initiated ART during acute infection, we employed the int-39 PRLS assay to compare plasma-derived RNA sequences from an ATI time point to pre-ART plasma-derived and proviral sequences trimmed to this region. Using the int-39 PRLS assay for these ATI time points, we were able to maximize the number of sequences available for comparison with the pre-ART plasma-derived and proviral genomes. We observed that all plasma-derived RNA genomes from the ATI plasma that were 100% identical to another ATI plasma sequence or a pre-ART plasma or PBMC sequence were genetically-intact in the trimmed int-39 region. However, it is important to note that the proportion of genetically-intact plasma-derived genomes in the pre-ART plasma increased from an average of 65% for these three participants to an average of 77% when the sequences were trimmed to the ;4.7-kb int-39 region, due to the mutations that rendered the genomes defective being found outside of the trimmed int-39 region. Future studies will therefore require sequencing of ATI plasma samples using the gag-39 amplicon to elucidate the contribution of virions containing genetically-defective genomes to viral rebound upon ART interruption.
Within the trimmed int-39 region, we identified plasma-derived RNA sequences from ATI plasma samples that were 100% identical to pre-ART plasma-derived RNA sequences from all 3 participants. This is consistent with previous studies that have indicated that the viral quasispecies present pretherapy is similar to the quasispecies found during ART interruption (16,54). Kearney et al. (15) also found examples of pre-ART plasma-derived sequences that were 100% identical to rebound plasma-derived sequences in a smaller region of the HIV-1 genome. These sequences were also found to be identical to variants that expressed cell-associated HIV-1 RNA during ART suppression, indicating their persistence and transcriptional activity during ART (15). Furthermore, we have identified plasma-derived RNA sequences in 2 of 3 participants from the ATI time point that were 100% identical within the trimmed int-39 region to several pre-ART proviral and pre-ART plasma-derived sequences. Of note, a subset of these pre-ART sequences were 100% identical and genetically-intact in the longer regions sequenced using the gag-39 PRLS assay and FLIPS. This suggests that cells integrated with this proviral sequence have persisted during ART and have contributed to viral rebound upon ART interruption, though we have not confirmed the presence of this sequence in proviral sequences obtained from samples during ART suppression. The contribution of proliferating cells to viral rebound during ART interruption and residual viremia during ART has been described previously (15,17,48,101) but to our knowledge has not been demonstrated using an amplicon of this size. We note that there were several pre-ART plasma-derived and proviral sequences that were identified as 100% identical to this cluster of identical sequences within the trimmed int-39 region but had mutations outside of the trimmed int-39 region, some of which led to a premature stop codon or deletion in the cis-acting region of the genome. Future studies will require sequencing plasma-derived RNA from ATI samples using the gag-39 amplicon to further elucidate the contribution of identical and intact proviral and plasmaderived sequences from pre-ART samples to rebound virions by determining whether the rebound viral sequences are truly clonal and/or match proviral sequences from proliferating cells.
Taken together, our findings highlight the gag-39 or int-39 PRLS assay as a more definitive method for identifying the cellular source of plasma viremia in studies of HIV-1 replication or rebound than methods that sequence shorter subgenomic regions of the RNA genome. Many ATI studies that sequenced shorter subgenomic regions in HIV-1-infected participants have been unable to identify a definitive cellular or anatomic source of rebound viremia due to low numbers of matching proviral and rebound sequences (25,(102)(103)(104) or that the rebound RNA sequences match proviral sequences from multiple anatomic sites within the small region sequenced (17,105). The PRLS assay will allow a greater understanding of viral rebound during ATI by identifying the rebound RNA sequences that are genetically-intact, and therefore likely to be infectious, and by allowing the determination of the cellular source of these genomes. Sequencing the near-full-length genome from rebound virus provides additional insight into the contribution of recombinants to rebound viremia, as suggested by previous studies (25,102,104,106). This indicates the great potential of the PRLS assay in identifying the cellular and anatomic source of genetically-intact plasma HIV-1 RNA genomes in viremic individuals.
However, we acknowledge several limitations of our study. First, the gag-39 PRLS assay amplifies approximately 90% of the HIV-1 RNA genome, meaning that sequences that are identified as identical in this sequenced region may contain mutations in genomic regions outside of the sequenced area. However, this occurrence is predicted to be rare due to the large portion of the genome that is amplified using the gag-39 PRLS assay. Similarly, sequences identified as identical by using the int-39 PRLS assay may contain mutations in genomic regions outside of the sequenced area. Second, due to the extensive secondary structure in the 59 untranslated region (UTR) of the HIV-1 RNA genome, we were unable to extend the region sequenced using the gag-39 PRLS assay beyond the beginning of gag. This means that we are not able to sequence the cis-acting region of the HIV-1 genome, or the major splice donor (MSD, D1), and therefore determine whether defects in these regions, which are common in proviral sequences during effective ART (38,64,68,107), are present in plasma-derived HIV-1 RNA genomes. Third, as we have not tested the in vitro replication competency of the sequences obtained in this study by viral outgrowth assay, we cannot be sure that the plasma RNA genomes identified as genetically-intact are truly replication-competent. Fourth, although we believe that RNA genomes with frameshift mutations due to small indels at homopolymer regions are present in the plasma, we are unable to distinguish the contribution of assay-related error and true HIV-1 replication-related mutation. Fifth, the gag-39 PRLS assay can be used only on samples with high viral loads, as the instability of RNA molecules makes cDNA synthesis and PCR amplification of the ;8.3 kb length inefficient. We have addressed this limitation by developing the int-39 amplification, which is able to amplify and sequence plasma samples with lower viral loads, such as samples that are ,350 copies/mL. Sixth, we have quantified only the proportion of defective HIV-1 plasma-derived RNA genomes in the near-full-length gag-39 region in untreated participants with very high viral loads. It will be important to determine whether the high proportion of defective genomes remains in plasma samples with lower viral loads or whether the production of these virions with defective genomes is selected against during low-level replication. Finally, our observations regarding viral rebound during ATI were based on only three participants who initiated ART during acute infection, which may have increased our chance of identifying 100% identical sequences between pre-ART and ATI samples, due to the lower genetic diversity of genomes from participants who initiate ART during acute infection (50).
In conclusion, we have developed a new HIV-1 plasma RNA sequencing assay that enables the amplification and sequencing of near-full-length RNA genomes by using primers spanning gag-39 (;8.3 kb) or int-39 (;5 kb). We have used the gag-39 PRLS assay to demonstrate that 26 to 51% of HIV-1 RNA genomes are genetically-defective in the plasma of HIV-1-infected individuals during untreated infection. Importantly, we identified near-identical sequences between the plasma and PBMC compartments in 4 out of 6 acute/early infection participants, and 3 of these participants had 100% identical sequences between these two compartments. By employing the int-39 PRLS assay, we found that the majority of ATI plasma-derived sequences were identical to another sequence and genetically-intact. Several sequences from the ATI plasma samples were identical to viral sequences from pre-ART plasma and PBMC samples, indicating that HIV-1 reservoirs established prior to therapy contribute to viral rebound during an ATI. Our findings highlight the importance of sequencing long-read RNA transcripts at a single-molecule level to gain an accurate picture of the genetic landscape of plasma HIV-1 virions in studies of HIV-1 replication and persistence.

MATERIALS AND METHODS
Participant samples and study approval. Stored plasma samples from 8 ART-naive HIV-1-infected participants were obtained for this study ( Table 2). All participants were infected with HIV-1 subtype B. Six of these participants (P1 to P6), recruited through the Phaedra/Core01 AIEDRP protocol, had their blood draws taken during acute/early HIV-1 infection (1 to 4 months postinfection). For participants P1 to P6, cryopreserved PBMCs were also obtained from the same blood draw as the plasma for comparison (PBMC cell numbers ranging from 306,000 to 1,976,000 cells) ( Table 2). Acute and early infection were defined as described previously (55).
For participants P3 to P5, stored plasma samples taken during an analytical treatment interruption (ATI) were also obtained for this study (Table 2; Fig. 8). Participants initiated ART during acute infection and continued treatment for 1 year. They then underwent three ATIs over the course of the following year, reinitiating ART for a period of time between, as described previously (55). Plasma samples with viral loads of ;1,000 copies/mL were obtained from the first ATI for participant P3 and from the second ATI for participants P4 and P5 (Fig. 8).
Two participants (P7 and P8) had their blood draws taken during chronic HIV-1 infection (.12 months postinfection). P7 also had a second blood draw taken approximately 27 days after the first blood draw (time point 2 [TP2]).
Study approval was obtained from the institutional review board at the Western Sydney Local Health District, which includes the Westmead Institute for Medical Research. Phaedra/Core01 was approved through the St Vincent's Hospital and University of New South Wales (UNSW) institutional review boards. All participants provided written informed consent.
RNA extraction. Up to 1.5 mL plasma was centrifuged at 5,300 Â g for 10 min at 4°C to pellet any cellular debris present in the sample. The supernatant was transferred, and viral RNA extraction was performed as previously described, with minor modifications (18). Briefly, plasma was centrifuged at 21,000 Â g for 1 h at 4°C. The supernatant was removed, and 60 mL Tris-HCl (pH 7.5) containing 200 mg proteinase K was added to the pellet. This was incubated at 55°C for 30 min. Two hundred microliters of 5.8 M guanidium isothiocyanate containing 200 mg glycogen was then added, and the mixture was incubated at room temperature for 5 min. Next, 270 mL 100% isopropanol was added, and the lysate was centrifuged at 21,000 Â g for 15 min. The supernatant was removed, and the pellet was washed with 70% ethanol. The pellet was then dissolved in 35 mL 5 mM Tris-HCl (pH 7.5) containing 1 mM dithiothreitol (DTT).
For plasma samples diluted to viral loads of ,100 copies, a second similar viral RNA extraction protocol was used as described previously (108), with minor modifications. Briefly, plasma supernatant was topped up to 9 mL with Tris-buffered saline, and samples were ultracentrifuged at 185,500 Â g for 35 min at 4°C in a type 70.1 TI rotor (Beckman Coulter). The supernatant was then removed, and 100 mL 5 mM Tris-HCl (pH 7.5) containing 200 mg proteinase K was added to the pellet. Samples were incubated at 55°C for 30 min. Three hundred twenty-five microliters of 5.8 M guanidinium isothiocyanate containing 200 mg glycogen was then added, and samples were incubated at room temperature for 15 min. The lysate was transferred to a microcentrifuge tube, and 495 mL 100% isopropanol was added. The lysate was then centrifuged at 21,000 Â g for 30 min. The supernatant was removed, and the pellet was washed in 70% ethanol and dissolved in 35 mL 5 mM Tris-HCl (pH 7.5) containing 1 mM DTT.
cDNA synthesis. Extracted RNA was split into two 16 mL reactions for cDNA synthesis. For each reaction mixture, the RNA was first treated with ezDNase (Invitrogen), according to the manufacturer's instructions, with volumes of reagents doubled. This was to remove any contaminating DNA that was extracted alongside the RNA. Briefly, 2 mL 10Â ezDNase buffer and 2 mL ezDNase were added to the 16 mL RNA, and the mixture was incubated at 37°C for 2 min. To inactivate the ezDNase, the mixture was then incubated at 55°C for 5 min in the presence of 10 mM DTT.
The cDNA synthesis was adapted from the protocol for the single-copy assay (108). Following DNase treatment, a 21.35 mL mixture consisting of 14 mM HIV-1-specific anchored oligo(dT) primer (7dT) (39) and 2.8 mM deoxynucleoside triphosphate (dNTP) (Invitrogen) was mixed with the RNA. The mixture was incubated for 65°C for 5 min, before it was immediately transferred to ice. A buffer mix consisting of 12 mL 25 mM MgCl 2 and 6 mL 10Â PCR buffer II (Invitrogen) plus 0.2% Tween was then added to the mix, followed by 40 U AffinityScript Multiple Temperature Reverse Transcriptase (Agilent Technologies). The cDNA synthesis cycling conditions were as follows: 42°C for 5 min, 55°C for 1.5 h, 70°C for 15 min, 4°C hold. The two separate reaction wells were pooled prior to amplification of the cDNA.
To quantify the viral load of plasma samples, cDNA synthesis was performed using random hexamers, as adapted from the protocol for the single-copy assay (108). Extracted RNA was treated as described in the section above, with the HIV-1-specific anchored oligo(dT) primer replaced with random hexamers (Promega) at the same concentration. The cycling conditions for the random hexamer cDNA synthesis were as follows: 15 min at 25°C, 42°C for 40 min, 85°C for 10 min, 25°C for 10 min, 4°C hold.
PCR amplification of cDNA. The nested PCR amplification as described previously (38) was adapted to amplify near-full-length genomes from cDNA synthesized from plasma of HIV-1-infected individuals.
DNA extraction and amplification of proviral DNA. Cryopreserved PBMCs from the same blood draws as the plasma samples were lysed using DirectPCR lysis reagent (Viagen Biotech) containing 0.4 mg/mL proteinase K, at an approximate concentration of 1 Â 10 6 cells/100 mL lysis buffer. The lysate was incubated at 55°C overnight (16 h) at 750 rpm using an Eppendorf ThermoMixer, followed by 95°C for 10 min at 750 rpm to inactivate the proteinase K.
Proviral DNA was amplified from lysed PBMCs by the full-length individual proviral sequencing (FLIPS) assay as previously described (38). The amplified products were identified following visualization on a 1% agarose gel, with PCR products of .8.9 kb length selected for sequencing.
Sequencing and assembly of amplified genomes. Amplified cDNA and PBMC positives were prepared for next-generation sequencing on the Illumina MiSeq platform as described previously (38).
HIV-1 genomes were assembled using either a workflow designed in CLC Genomics (Qiagen) as described previously for FLIPS (38) or a custom de novo assembly pipeline as described in a recent publication (109). Briefly, generated reads are first subject to QC trimming using BBDuk v37.98 (https://sourceforge .net/projects/bbmap/). Draft genomes are then assembled de novo using MEGAHIT v1.1.3 (110). Assembly of the genome is confirmed by mapping the reads onto the draft genome using BBMap v37.98. This is visualized using Geneious Prime v.2020.0.3, followed by extraction of the final majority consensus genome.
For genomes amplified from plasma-derived cDNA or lysed DNA from PBMCs, genetically-defective genomes were identified as those with inversions, deletions of .100 bp length, APOBEC3G-induced hypermutation, point mutations causing deleterious stop codons or missing start codons, or small (,100 bp) insertions or deletions (indels) causing frameshift mutations within HIV-1 ORFs (excluding tat exon 2 and nef) (38,(111)(112)(113). In addition, the cis-acting region, encompassing the packaging signal and major splice donor (MSD), was also assessed in the proviral genomes amplified from PBMCs. This was not assessed in plasma-derived genomes, as the PRLS assay does not amplify this region. Genomes with a deletion in the packaging signal or mutated MSD were classified as defective, though sequences with a mutated MSD could be salvaged by the presence of a cryptic splice donor found at HXB2 positions 748 to 749 (38,114). Defects in ORFs were identified using the Los Alamos database tools GeneCutter (https://www.hiv.lanl.gov/content/sequence/GENE _CUTTER/cutter.html) and Hypermut (https://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html), as well as the National Cancer Institute (NCI) Proviral Sequence Database (PSD) Proviral Sequence Annotation & Intactness Test (https://psd.cancer.gov/tools/tool_index.php) (115). For proviral sequences obtained from PBMCs, only sequences .8,800 bp in length, or those lacking deletions .100 bp in size, were included in the study.
Further analysis of sequences. Identical sequences were identified using the Los Alamos tool ElimDupes (https://www.hiv.lanl.gov/content/sequence/elimdupesv2/elimdupes.html) or by manually assessing matrices highlighting the number of differences between sequences, which were constructed from alignments using FastTree version 2.1.11 (116). A threshold of 100% identity was used to identify identical sequences, while near-identical sequences that differed by 1 nt were identified using a threshold of .99.98% identity for gag-39 sequences and a threshold of .99.97% identity for sequences trimmed to the ;4.7-kb region overlapping the gag-39 and int-39 amplicons. Sequences were classified as part of a cluster of .99.98% or .99.97% identical sequences if the sequence differed from another sequence in the cluster by 0 or 1 nt. To compare sequences amplified by PRLS and FLIPS, or to compare the number of clonal sequences identified by PRLS to that identified using SGS of short HIV-1 fragments such as env and p6-RT (8), alignments were trimmed to the primer binding sites yielding the shortest amplicon before additional analyses were performed.
Sequences were visualized using maximum likelihood phylogenetic trees constructed using FastTree version 2.1.11 (116), using the generalized time-reversible model, with gaps/deletions weighted by proportion of nongaps. Branch support values were calculated using 1,000 bootstrap replicates. Phylogenetic trees were visualized using ggTree (117), with the custom script described elsewhere (118). Hypermutated sequences were removed from alignments when phylogenetic trees were constructed. Genetic diversity of geneticallyintact sequences was estimated by average pairwise distance (APD) using MEGA6 software (https://www .megasoftware.net/). APD was calculated when at least five genetically-intact sequences were available (28).
Tests for panmixia indicate the probability that two groups of sequences are two genetically-distinct populations (52,53). We used a threshold of P , 10 23 to indicate evidence for compartmentalization between two different populations of sequences (52,54). For the majority of participant samples, 20 randomly chosen genetically-intact sequences were compared per population, with sequences that were 100% identical excluded.
Determination of error rate. The reporter cell line J-Lat (clone 10.3) (40) was obtained from the NIH AIDS Reagent Program, Division of AIDS, NIAID, NIH. J-Lat cells are latently-infected with HIV-1, harboring one copy of HIV-1 proviral DNA per cell. Latency reversal was induced by incubating 5 Â 10 4 cells with 2 ng/mL phorbol 12-myristate 13-acetate (PMA) for 48 h. The culture supernatant from the activated cells was harvested and stored at 280°C until used for RNA extraction. Following RNA extraction and cDNA synthesis, 94 genomes were amplified and sequenced using the PRLS assay, and SNPs or frameshift mutations were identified in the genomes that were genetically different from the reference sequence for the J-Lat cell line (41). The number of nucleotides involved in these changes was quantified and divided by the total number of nucleotides sequenced (828,705 total nucleotides) to determine the mean and 95% CI for the error rate of the assay.
Viral load quantification. For participants P1 to P6, the viral load was determined by the Roche Amplicor kit 1.5 (Roche Diagnostics, Pleasanton, CA). Viral loads for participants P7 (TP1 and TP2) and P8 were quantified after cDNA synthesis, primed using random hexamers, using a droplet digital PCR (ddPCR) assay specific for the HIV-1 long terminal repeat (LTR) as described previously (119). The threshold was set using the ddpcRquant tool using the default parameters (120).
Dilution of plasma samples. To test the lower limits and reproducibility of the PRLS assay gag-39 amplification, the viral load of participants P6 and P7 was used to dilute these samples so that they ranged from 1,000 to 10,000 copies per dilution, and each diluted sample was extracted and processed using the gag-39 PRLS assay. Similarly, to test the lower limits and reproducibility of the int-39 amplification, the viral loads of participants P6, P7, and P8 were used to dilute these samples to viral loads ranging from 10,000 copies to ,50 copies per dilution. These diluted samples were then extracted using the appropriate extraction method and processed using the int-39 PRLS assay. The plasma was diluted in 1Â PBS (pH 7.4), RNase free (Invitrogen). For each diluted plasma sample, 10 mL of each participant plasma sample was extracted, and the number of viral copies was measured by ddPCR. This determined the actual number of copies of each diluted plasma sample that was sequenced. The lower limit of the assay was determined to be the lowest copy number for which an amplicon was obtained. A simple linear regression was calculated using Prism software (GraphPad) to demonstrate a positive correlation between the number of amplicons and the estimated copy number.
Determination of in vitro recombination rate. To determine the rate of assay-related intertemplate recombination introduced during reverse transcription, plasma samples from two different participants containing approximately 12,000 HIV-1 RNA copies were mixed prior to RNA extraction and gag-39 sequencing. Plasma samples from each of the individual participants were also extracted and sequenced to obtain sequences for comparison to those from the mixed plasma sample. The number of extracted copies in each of these individual participant extractions was measured by ddPCR. From these individual participant extractions, it was determined that 13,304 copies from P6 and 11,001 copies from P7 were added to the mixed plasma extraction.
Two methods were used to investigate whether any intertemplate recombination was introduced during the mixed extraction and reverse transcription. All sequences obtained from the mixed extraction (87 sequences) were aligned to the sequences obtained from the individual extractions (P6, 48 sequences; P7, 56 sequences), and a phylogenetic tree was prepared from this alignment. Any in vitro recombination would be evident by sequences from the mixed extraction appearing separately to, but between, the sequences from the two extractions from the individual participant sequences. Clear separation was observed between the sequences from each participant, and the sequences from the mixed extraction grouped with only the individual participant sequences, indicating no evidence of intertemplate recombination.
We also used the RDP4 tool (121), using the default settings, to look for evidence of recombination between sequences from the mixed extraction and sequences from the individual extractions. A possible recombination event was identified if the event was predicted by at least 5 of 7 methods. Evidence of withinpatient recombination was observed by multiple methods in participant P7, but no evidence of recombination between sequences from P6 and P7 in the sequences from the mixed extraction was observed.
Statistics. Statistical tests were performed using either Jamovi version 1.8.4 (122) (R Core Development Team [2020]), or STATA version 15 (123). A Kruskal-Wallis test was used to compare the proportions of defective plasma-derived sequences with each type of genetic defect, with pairwise comparisons performed using Dunn's test with the Bonferroni multiple comparison correction. Mann-Whitney tests were used to compare the proportions of genetically-intact or defective plasma-derived sequences between the acute/early infection and chronic infection groups. A factorial analysis of variance (ANOVA) was used to compare the numbers of defects in plasma-derived sequences as a proportion of full-length genomes amplified in gag, pol, and env for the acute/early and chronic infection groups. The data were first log transformed (log 10 ), and the normality of the transformed data was confirmed by constructing Q-Q plots (124). Post hoc tests for pairwise comparisons were performed using the Tukey protocol. The proportions of trimmed int-39 sequences that were 100% or .99.97% identical to another sequence within this region were compared between the pre-ART plasma and post-ART plasma using a paired t test and a Wilcoxon test, respectively. The proportions of trimmed int-39 sequences that were genetically-intact in this region were compared between pre-ART and ATI plasma samples using a paired t test. The normality of the data was determined using Shapiro-Wilk tests.
Data availability. All sequences analyzed in this study have been deposited in GenBank under accession numbers OK532372 to OK533278 and OM177064 to OM177172.
(AGRF Sydney Site, the Westmead Institute for Medical Research, Sydney, Australia) for operating the MiSeq, and the Westmead Research Hub (Westmead Institute for Medical Research, Sydney, Australia) for access to droplet digital PCR resources. We acknowledge Gabriel Duette and Bethany A. Horsburgh for their assistance in the preparation of the data for this study. We also acknowledge Ansari Shaik, the Pulse Investigators, and Kate Merlin, Bertha Fsadni, Sri Meka and Julie Jurczyluk from the Clinical Trials & Biorepository Group, St Vincent's Centre for Applied Medical Research, Sydney for their help with participant sample selection. We acknowledge with gratitude the participants involved in this study. Participants with acute and early infection were recruited through the Sydney HIV Primary Infection Network. We declare that no conflict of interest exists.